Web Robot Detection in Academic Publishing

نویسندگان

  • Athanasios Lagopoulos
  • Grigorios Tsoumakas
  • Georgios Papadopoulos
چکیده

Recent industry reports assure the rise of web robots which comprise more than half of the total web trac. Œey not only threaten the security, privacy and eciency of the web but they also distort analytics and metrics, doubting the veracity of the information being promoted. In the academic publishing domain, this can cause articles to be faulty presented as prominent and inƒuential. In this paper, we present our approach on detecting web robots in academic publishing websites. We use di‚erent supervised learning algorithms with a variety of characteristics deriving from both the log €les of the server and the content served by the website. Our approach relies on the assumption that human users will be interested in speci€c domains or articles, while web robots crawl a web library incoherently. We experiment with features adopted in previous studies with the addition of novel semantic characteristics which derive a‰er performing a semantic analysis using the Latent Dirichlet Allocation (LDA) algorithm. Our real-world case study shows promising results, pinpointing the signi€cance of semantic features in the web robot detection problem.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Motion detection by a moving observer using Kalman filter and neural network in soccer robot

In many autonomous mobile applications, robots must be capable of analyzing motion of moving objects in their environment. Duringmovement of robot the quality of images is affected by quakes of camera which cause high errors in image processing outputs. In thispaper, we propose a novel method to effectively overcome this problem using Neural Networks and Kalman Filtering theory. Thistechnique u...

متن کامل

The Ontogenesis Knowledgeblog: Lightweight publishing about semantics, with lightweight semantic publishing

The web has moved from a minority interest tool to one of the most heavily used platforms for publication. Despite originally being designed by and for academics, it has left academic publishing largely untouched; most papers are available on-line, but in PDF and are most easily read once printed. Here, we describe our experiments with using commodity web technology to replace the existing publ...

متن کامل

A framework for statistical software development, maintenance, and publishing within an open-access business model

There are several fundamental problems with statistical software development in the academic community. In addition, the development and dissemination of academic software will become increasingly difficult due to a variety of reasons. To solve these problems, a new framework for statistical software development, maintenance, and publishing is proposed: it is based on the paradigm that academic...

متن کامل

Development of RadRob15, A Robot for Detecting Radioactive Contamination in Nuclear Medicine Departments

Accidental or intentional release of radioactive materials into the living or working environment may cause radioactive contamination. In nuclear medicine departments, radioactive contamination is usually due to radionuclides which emit high energy gamma photons and particles. These radionuclides have a broad range of energies and penetration capabilities. Rapid detection of radioactive contami...

متن کامل

Guidelines for selecting journals that avoid fraudulent practices in scholarly publishing

In recent years, scholarly publishing has been faced with many distractive phenomena. Generally, most researchers are unaware of fraudulent practices now common to scholarly publishing and are at risk of becoming a victim of them. Editors also need to have sufficient knowledge about these practices. There are papers that try to increase awareness of authors about fraud in scholarly publishing, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1711.05098  شماره 

صفحات  -

تاریخ انتشار 2017